DoorDash Case

Project Overview

Build a ML model to predict total delivery duration seconds, defined as the time from order creation by customers (created_at) to delivery (actual_delivery_time).

Underestimating delivery time is roughly twice as costly as overestimating it. Orders that are very early / late are also much worse than those that are only slightly early / late.

Detailed dictionary of the data is in the appendix.

1. Define Objective

The following asymmetric MSE loss function is used (see reference).

$$L = \frac{1}{2n} \sum_{i=1}^n \left| \alpha - \mathbb{1}_{(g(x_i) - \widehat{g}(x_i)) < 0} \right|\cdot \left(g(x_i) - \widehat{g}(x_i)\right)^2$$

Denote $\epsilon_i = g(x_i) - \hat g(x_i) = y_i - \hat g(x_i)$. The Jacobian and Hessian are

$$J = \left\{\begin{matrix} (\alpha - 1)\epsilon_i & \text{ if }\epsilon_i<0\\ -\alpha \epsilon_i & \text{ if }\epsilon_i\geq 0 \end{matrix}\right. $$
$$H = \left\{\begin{matrix} 1 -\alpha & \text{ if }\epsilon_i<0\\ \alpha & \text{ if }\epsilon_i\geq 0 \end{matrix}\right. $$

Since late delivery (positive residuals) is twice as costly as early ones, we can set $\alpha = 2/3$ The custom loss function is then

2. Importing and Data Cleaning

2.1 Create Potentially Useful Variables

Not all of them will be used in the final model.

2.2 Data Sanity Check

2.2.1 Duration Outliers - Can it take more than one day to deliver?

There is one item standing out, it is created on 2014-10-19, but the others are all in 2015! This must be a data error, so I will drop it.

Now, check the top few records with the highest duration. There are two records taking more than one day! That's probably due to reservation (i.e., submit order ahead of time). I don't think they are the focus of this study, since we focus on real-time same-day deliver (I assume for simplicity). So, I will drop them as well

Strangely, there are 7 records with missing delivery time. Sometimes, it may be due to censoring issue that the most recent orders' delivery time is not observed yet. However, this is not the case here, since these 7 orders span across different days (it is hard to imagine it takes more than 1 day to deliver!).

As a result, this may just be a data issue and can be ignored.

The other records looks reasonable.

2.2.2 Negative Values Where It Shouldn't Be

There are negative values in 'min_item_price',total_onshift_dashers','total_busy_dashers', and 'total_outstanding_orders', which doesn't make sense. Since they are few, I set them to NaN.

For now, I will change them to nan, but if I can, I should check if these entries are incorrect.

2.2.3 Missing Values

The first thing is to check if there are many missing values.

The only concerns are the high percentage (8%) of missing values for total_onshift_dashers, total_busy_dashers and total_outstanding_orders.

Is is because zeros values are regarded as missing? Let's see:

Nope! Zeros are in the data. So, we may need to impute them later. But before that, it is crucial to determien if it is missing completely at random, at random or not at random. We could infer it from the mean duration of obs with missing and non-missing values.

Summary

Therefore, we could regard them as either MCAR or MAR, which makes imputation easier.

2.2.4 Create other variables now that data is cleaned

So mainly from 2015-01-21 to 2015-02-18, about 28 days of data. If I use previous 7 days (smooth out day-of-the-week seasonality) average delivery time, then I lose 25% of the data. Therefore, I will not use historical moving average here.

3. EDA

Let's see what variables do we have at this step!

3.1 Dependent Variable

The duration is obviously skewed with a long right-tail. After taking log, the distribution becomes more symmetric.

3.2 Categorical Features

3.2.1 Count

Most orders can between 18-21 and 1-4am (UTC).

Very few orders for protocol 6 and 7.

3.2.2 Relation to duration

3.3 Numerical Features

3.3.1 Summary Stats

Note that the derived field total_free_dasher and total_free_dasher_percent are sometimes negative. Does this mean the dashers are "over-subscribed"?

3.3.2 Correlational Analysis

These variables seem very important:

There are two groups of highly correlated features (correlation > 0.6):

To avoid singular matrix inversion in linear regression, I will therefore use Ridge regression later to improve numerical stability.

4. Feature Engineering

I will use the following list to keep track of the variables

4.1 Encoding Categorical Features

The target encoding code is adapted from this link. Note that I don't want to use TimeSeriesSplit for target encoding, because the latest time series fold is nevered used, and the earlier ones used multiple times.

4.1.1 High Cardinal Features

There are three high-cardinality features:

I will use target encoding with CV to encode these variables.

The distribution of duration seems wide, so we have some variation.

Checking the # of orders per category. Using CV target encoding, if a category has low # of obs, it will just be encoded (closer to) the overall average duration.

**Feature Engineering 1**

4.1.2 Low Cardinality Categorical Features

I can use one-hot encoding, but since lightgbm handles these integer-coded variables well, I will keep them as-is.

4.2 Bin and Encode Similar Numeric Features

The following numeric variables are also measuring market capcity, yet some

One way using dimension reduction to combine them (e.g., PCA). But it only considers linear combination thereof. Here, I will use a different method:

There are some ordering relationship between them and duration. In this case, I will use target encoding again for help ML learn this feature

And we also solved for their missing values problem! Nice!

**Feature Engineering 2**

4.3 Impute Missing Values

4.3.1 MCAR and Low Missing Percentage

Market ID is an important identifier, and will affect subsequent analysis. For simplicity, I will fill it with the most frequent category. This is ok, because less than 0.5% of market id is missing, and there is not difference in average duration.

The only variables with significant percentage missing are 'total_onshift_dashers',total_busy_dashers','total_outstanding_orders and 'total_order_per_onshift_dasher'.

The rest I will just left for lightgbm to handle.

I will use the IterativeImputer in Scikit-Learn. However, I need to one-hot encode the remaining cat variables first.

4.4 Create Moving Averages

Time series data usually have high auto-correlation. Historical averages are thus very useful.

calucalte moving averages in the last week by market_id and time_slot

Now merge with previous data

**Feature Engineering 3**

impute the missing moving averages

5. Model Selection and Tuning

I will use two sets of models: one is linear quantile regression, and the other is gradient boosting.

5.1 Tunning

5.1.1 Ridge Regression

Use ridge regression to address some issue of high collinearity (as Ridge will improve numerical stability)

5.1.2 LightGBM

I suspect the default parameters are good enough. Nonetheless, I will use Bayesian Optimization to tune the hyperparameters. I will only tune the most important ones.

Use Early Stopping

I first use early stopping to automatically select num_iterations, and focus on the other parameters. This will ususally speeds us tuning.

Tune num_iterations

Given the other parameters, pin down num_iterations (univariate search)

5.2 Model Selection Using CV

5.2.1 Ridge Regression

5.2.2 LightGBM

5.3 Model Evaluation

5.3.1 Overall Performance

I will look at the symmetric RMSE and asymmetric RMSE where under-prediction is penalized twice as much as over-prediction.

The optimized LGBM model preformes the best.

5.3.2 Residual Distribution

The Ridge regression generally under-predicts, so its residual distribution is to the right of the rest two.

5.3.3 Time Series Plot

Lastly, I will look at the total delivery time per day.

6. Model Prediction and Intrepretation

6.1 Train the selected model on all training data

6.2 Sanity Check & Insample Fit

6.3 Intrepretation

We can also see the partial dependence (directional) here:

7. Prediction and Save

Appendix

Data Dictionary

Time features

Store features

Order features

Market features

The following features are values at the time of created_at (order submission time)

Predictions from other models:

We have predictions from other models for various stages of delivery process that we can use.